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ABSTRACT 

Distances in a network capture relations between nodes and are the basis of 
centrality, similarity, and influence measures. Often, however, the relevance 
of a node u. to a node v is more precisely measured not by the magnitude of 
the distance, but by the number of nodes that are closer to v than u. That is, 
by the rank of u in an ordering of nodes by increasing distance from v. 

We identify and address fundamental challenges in rank-based graph 
mining. We first consider single-source computation of reverse-ranks and 
design a “Dijkstra-like” algorithm which computes nodes in order of in¬ 
creasing approximate reverse rank while only traversing edges adjacent to 
returned nodes. We then define reverse-rank influence, which naturally ex¬ 
tends reverse nearest neighbors influence IKorn and Muthukiishnan 2000] 
and builds on a well studied distance-based influence. We present near- 
linear algorithms for greedy approximate reverse-rank influence maximiza¬ 
tion. The design relies on our single-source algorithm. Our algorithms 
utilize near-linear preprocessing of the network to compute all-distance 
sketches. As a contribution of independent interest, we present a novel al¬ 
gorithm for computing these sketches, which have many other applications, 
on multi-core architectures. 

We complement our algorithms by establishing the hardness of comput¬ 
ing exact reverse-ranks for a single source and exact reverse-rank influence. 
This implies that when using near-linear algorithms, the small relative er¬ 
rors we obtain are the best we can currently hope for. 

Finally, we conduct an experimental evaluation on graphs with tens of 
millions of edges, demonstrating both scalability and accuracy. 

1. INTRODUCTION 

Shortest-paths distances in a network are a classic measure of 
the relation between nodes and are the basis of similarity 
centrality 0[^[ig|g[T^|^|^[^, and influence 
[ID measures. Often, however, the relation of a node j to i is more 
correctly modeled not by the magnitude of the distance dji from 
j to i, but by i’s position in an ordering of nodes according 
to increasing distance from j |14[ |24[ |39| [20) . A classic use of 
rank as an indicator of relevance in metric spaces is the k nearest 
neighbors (kNN) classifier, which classifies points based on the k 
closest labeled examples |14[ |24| . In terms of popularity, kNN 
outweighs the respective distance-based classifiers, which instead 
use all examples within a certain distance. 

More formally, we view a node j as ranking other nodes accord¬ 
ing to their distance order from j. The rank nji is the position of i 
in increasing order from jf] Accordingly, from the perspective of 
node j, we can refer to -Kij as a reverse rank. 

An advantage of using rank is that it provides a different signal 
than distance by “factoring out” the effects of uneven density. This 
is illustrated in the toy social network in Figure We expect node 
A to be more important to node C than it is to node B, even though, 
A is closer to B than to C (dcA > dsA). This is because B has 

'vTji is also termed the Dijkstra rank of i, since Dijkstra’s algorithm 
from source j processes nodes in increasing distance. 



Figure 1: Example undirected social network (edge lengths are 
proportional to drawn lengths). 


a dense neighborhood of closer node than A, but C has only two 
nodes closer to it than A. In terms of ranks, we have uca = 3 and 
ttsa = 6 and thus tica < ttba, which reflects this intuition. 

The rank relation is asymmetric: In the example network in Fig¬ 
ure [T] we have ttba = 6, since there are 5 nodes closer to B than 
A, and ttab = 1, since B is the closest node to A. Therefore, 
ttab 7 ^ ttba even though the distance is symmetric (dAB = Aba)- 
The asymmetry ttba > ttab reflects our intuition that the higher 
degree node {B) has more influence on its neighbor {A) than the 
reverse. 

In particular, with tie breaking on distances, a node v has exactly 
one nearest neighbor, but can have 0 to many reverse nearest neigh¬ 
bors, which are nodes u which satisfy ttuv = 1. In our example, 
node A has no reverse nearest neighbors. The number of reverse 
nearest neighbors of a point n is a well studied notion of n’s influ¬ 
ence, proposed by Korn and Muthukrishnan |27| , and considered 
in metric spaces and in graphs |28| . 

In the basic model, which is sometimes called monochromatic 
dD , all nodes both rank and get ranked. A natural extension (bichro- 
matic model (271 ) allows only a subset of the nodes to get ranked 
(rankees) and also permits nodes that provide ranks {rankers) to 
have different weights. In this model, -Kij relates a ranker % to ran- 
kee j. This distinction is useful when nodes have two or more types 
of entities, for example, users (rankers) and content (rankees). It 
also allows us to specify a special small set of certified rankees such 
that we can characterize properties of other nodes by this smaller 
set of ranks. Importance weights /3(i) > 0 assigned to rankers can 
correspond to properties like purchase power or trust level. The 
ranks assigned by ranker i are then weighted by P{i). This weight¬ 
ing is useful when we aggregate the scores of multiple rankers to 
obtain centrality/influence scores of rankees. 

1.1 Contributions and Overview 

Rank-based measures provide a natural alternative to distance- 
based ones, but algorithmically pose different challenges. We iden¬ 
tify and motivate fundamental challenges and present scalable al¬ 
gorithmic tools which facilitating rank-based graph mining. 
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Reverse-rank Single-source computation 

An important tool in working with distances is an efficient single¬ 
source computation. Dijkstra’s algorithm from a source s computes 
the distances dsi for all nodes i in near-linear time. A powerful 
property of Dijkstra’s algorithm is sorted access: Nodes are re¬ 
vealed in order of increasing distance from s. Thus, for any k, the 
k closest nodes to s are computed while traversing only edges ad¬ 
jacent to these k nodes. Therefore, if we are only interested in a 
prefix of the closest nodes, we can terminate the execution after 
they are computed, performing a fraction of the computation of a 
full execution. When we work with ranks, we would he instead 
interested in a prefix of the highest ranks. Such sorted access to 
rankings is also important for efficient aggregating rankings |16| . 

The reverse-rank single-source problem is, for a node i £ [/, to 
compute the reverse ranks itji with respect to all nodes j. More¬ 
over, we aim for an efficient algorithm that provides sorted-access: 
Listing nodes in increasing iiji order with an algorithm that only 
traverses edges adjacent to listed nodes. 

A naive solution for exact reverse-rank single source computa¬ 
tion from i is to run Dijkstra’s algorithm from each node j, until 
node i is processed. For the average node, this is equivalent to per¬ 
forming n runs of Dijkstra until revealing on average n/2 nodes. 
Note that even on sparse networks, this scales quadratically in the 
number of nodes, which is prohibitive even on mid-size networks. 
This is in sharp contrast to the shortest-path single-source compu¬ 
tation which takes (near) linear time. Previous work (28| proposed 
ways to scalably identify the set of reverse nearest neighbors of 
nodes, but did not address higher reverse ranks. We are able here 
(Section]^ to provide an explanation, establishing that the naive 
solution is in a sense the best we can do for the exact problem: 
We leverage the theory of subcubic equivalence |38| and construct 
a reduction from graph radius computation to reverse-rank single 
source computation. The former is known to have subcubic equiv¬ 
alence to all-pairs shortest-paths computation (APSP) (TJ. 

This hardness result, fortunately, applies to the exact problem. 
An important contribution we make here (see Section is de¬ 
vising a novel, scalable, Dijkstra-like (sorted access), approximate 
reverse-rank single-source algorithm, which provides estimates fiy 
with a small relative error. Since ranks are intrinsically slightly 
noisy measures of the actual relations, estimates with a small rela¬ 
tive errors are often as good as the exact values. 

An essential component of our design is a preprocessing step 
where we compute All-Distances Sketches (ADS) for all nodes 
[T^[^. The sketches provide us with a fast oracle which estimates 
TTij from the distance dji. In Section]^ we review the sketches and 
estimators as applied in our context. We note that we can apply 
any ADS algorithm, and existing designs are suitable for sequen¬ 
tial, shared-memory, and node-centric message-passing computa¬ 
tions (6] [33] [Bl 012). A stand-alone contribution we make here 
is engineering an ADS algorithm for multicore architectures which 
provides provable tunable tradeoff between overhead and concur¬ 
rency. Our algorithm can be used for many other applications of 
the sketches which include estimating distances, closeness similar¬ 
ity, the distance distribution, and timed-influence @[^|T3|00 

EDdljlZ)- 

The estimation of single-source reverse ranks can be done by 
first running Dijkstra’s algorithm to compute the single-source dis¬ 
tances, and then apply the oracle we obtained in the preprocessing 
step to the computed distances. This method, however, will not 
provide us sorted access. Our sorted access algorithm, similarly to 
Dijkstra, also traverses a shortest-path tree rooted at the source, but 
critically, instead of doing so in distance order, which would violate 
rank-based sorted access, does so in order of increasing estimated 


reverse ranks. The correctness of our design relies on key insights 
on properties of reverse ranks. 

Reverse-rank influence 

Distance or reachability-based notions of centrality and influence 
of a set S of seed nodes are fundamental measures in network anal¬ 
ysis. In the general form |I I| , distance-based influence is defined 
with respect to a non-increasing decay function aja:) > 0 (smooth¬ 
ing kernel) and node weights /3(i) > 0. The contribution of each 
node j to the influence of S is proportional to its weight P{j) and 
decays with the distance of j from S, dsj = minigs dij: 

InfW(5) = E,/?0>(ds,). (1) 

Well studied special cases include Closeness centrality (3||34|[Tg 
0 112| |32| , where S contains a single node i, and the celebrated 
reachability-based influence model of )26| , obtained when a{x) = 
1 for finite x and 0 otherwise. Distance-based influence with thresh¬ 
old function a (a(x) = 1 for x < T and a{x) = 0 otherwise) was 
studied in |22|0[l5| (With distance interpreted as elapsed time)|^ 

Here we define reverse-rank influence 

'^(5) = Ej , (2) 

where itjs = minigs itji. The special case of Inf ^ (u), the in¬ 
fluence of a single node, with «(!) = 1 and a{x) = 0 otherwise is 
the number of reverse nearest neighbors of v, is the influence mea¬ 
sure proposed in |27[|28| . Our more flexible definition 0 is able 
to account for the contribution of nodes with higher reverse-rank 
to the influence of our node. For example, by setting a{x) = 1/x 
we achieve the effect that a reverse rank of x contributes 1 jx to the 
total influence of u; A node for which v is the 5th closest neighbor 
contributes to its influence 20% of what it would have contributed 
as a reverse nearest neighbor. With a being a T -threshold function, 
rankers u that rank v in their top T contribute fliu) to u’s influence. 

Reverse-rank Influence Computation: We show (Section^ that 
the computation of exact reverse-rank influence, even for a single 
node, and even when a is a threshold function, has subcubic equiv¬ 
alence to APSP. We therefore consider approximate influence Inf 
computed using approximate ranks tt. Clearly Inf (S') can be com¬ 
puted using |S| single-source approximate reverse-rank computa¬ 
tions (and using Ttjs = min^gs fr^i). Surprisingly, however, we 
show in Section0 that even with large |S|, one single-source com¬ 
putation suffices. 

Reverse-rank Influence Maximization: An important coverage 
problem which is extensively explored for reachability and distance- 
based influence, is influence maximization (IM) p6| : For a given 
s > 1, identify a set of s seed nodes with maximum influence. In¬ 
tuitively, such a set provides the best “coverage” for its size with 
respect to the influence measure at hand. Here we consider IM 
with respect to our reverse-rank influence function Inf ^. The 
reverse-rank IM problem with a being a threshold function with 
parameter T on our example user and movies data set is to find a 
set of s movies which maximizes the number of users for which 
there is at least one movie from S in their top T choices. 

Similar to the distance-based influence function Inf , Inf ^ 
is monotone and submodular, and even for simple threshold a, 
when s is a parameter, the IM problem is NP hard. The most com¬ 
mon and hugely successful algorithm for such coverage problems 
is the greedy algorithm HD’ which iteratively builds a seed set by 

^Reachability and distance-based influence were also explored as 
the expectation when edge lengths/presence are probabilistic. 
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selecting in each step a node with maximum marginal contribution. 
For submodular and monotone functions, greedy has the property 
that each prefix of the sequence of size s has influence that is at 
least 1 — (1 — 1/s)® > 1 — 1/e of the influence of the optimal 
seed set of that size (D- Exact greedy, however, does not scale 
well for very large graphs. For reverse-rank influence, an exact 
greedy sequence can be computed in cubic time in the number of 
nodes. When all nodes are both rankers and rankees, the graph is 
sparse, and we work with a threshold function a with parameter T, 
the computation reduces to 0{nT), by performing a single-source 
search from all nodes to find the T nearest neighbors and comput¬ 
ing a greedy cover. But even this special case does not scale well 
for large graphs for larger values of T. 

Approximate greedy and heuristics had been extensively studied 
for reachability-based |^|^[^[T^ and distance-based (niiiD 
influence. In particular, the SKIM algorithm |10[|11| computes in 
near-linear time a full greedy permutation so that each prefix of size 
s has approximation ratio of 1 — (1 — 1/s)® — e. 

In Section|^we present a near-linear algorithm which computes, 
an approximate greedy sequence with respect to the approximate 
reverse-rank influence objective with threshold function: Inf (S') = 
{\z G Z \ Ttzs < r|}. The algorithm we present builds on SKIM, 
but incorporates critical adjustments that utilize the sorted access 
property of our approximate reverse-rank single-source computa¬ 
tions. 

Experiments 

Our experimental evaluation, detailed in Section]^ was focused on 
scalability and solution quality, using publicly available anonymized 
social graph data sets. Our ADS implementation runs on graphs 
with tens of millions of edges in tens of minutes on a single core, 
providing estimates with NRMSE (normalized mean square errors) 
of 6%-13%. Our multithreaded design achieved speedup factors of 
3 to 4 on a machine with two CPUs and multiple cores. 

With the preprocessing in place, our approximate reverse-rank 
single-source computations have similar running time to Dijkstra’s 
algorithm (which computes single-source distances). In particu¬ 
lar, a reverse-rank single-source computation was performed in less 
than 15 seconds on a single core on a graph with 4x 10® nodes and 
35 X10® edges. For comparison, this should be contrasted with the 
running time of an exact reverse-rank single source computation, 
which would have taken an estimated 6000 hours on the same in¬ 
stance. 

Using our implementation, we are able to visualize the reverse- 
rank distributions of some nodes in a large network, demonstrating 
how the distribution reveals information on the relative importance 
of a node in its locality. Prior to our work, it was not possible to 
scalably compute these distributions on large graphs. 

Our approximate greedy IM implementation computes the full 
sequence on graphs with tens of millions of edges in minutes. We 
also observe that for small graphs or small values of T, where we 
could compute an exact greedy sequence, the solution quality of 
our approximate sequence was very close to the exact one. 

2. PRELIMINARIES 

We introduce some necessary notation. For a numeric function 
r : X over a set X, the function k*(X) returns the k* smallest 
value in the range of r on X. If \X\ < k, we define k)!'(X) as the 
supremum in the range of r. If r is not specified, we return the k* 
smallest value in X. 

We work with networks modeled as directed or undirected graphs 
G = (U, E) with nodes V = [n] = {1,..., n} and edges E with 


lengths w{e) > 0. We use m = \E\ for the number of edges. A 
subset or all nodes U <Z V are specified as rankee nodes. We use 
the notation for the transpose graph, which is the graph with 
edges reversed. 

For nodes i,j, let dij be the shortest-paths distance from i to 
j. For y > 0, the rankee y-neighborhood ofi is the set of rankee 
nodes within distance y from i. We denote the neighborhood by 

N^{y) = {/ € U I dij < y} 

and its cardinality by Ui (y) = | A; (i/) |. We use the notation (y) = 
{j £ U \ dij < y} for the respective strict neighborhood and 
ni{y) for its cardinality. For i £ V and j £ U, tiij denoted the 
rank of j with respect to i. When distances are unique, we have 
Ttij = ni(dij), that is, equal to the number of rankee nodes that 
are at least as closer to i as j. When distances are not unique, we 
consider the range , vfij], where 

Kij = Hiidij) , (3) 

TVij ~ ■ 

According to what we want to capture, we can define the rank ttij 
as either, ttij, -F 1, a uniform at random choice from the range, 

or as the midpoint of this range: Tr^ = , Our algorithms 

and implementation can be adapted to support all these choices. 

An important ingredient of our design is the computation of a 
data structure which allows us to efficiently estimate the number of 
rankees in a neighborhood. That is, for a query specified by a node 
i and d > 0, return hi{d). The data structure can be viewed as a 
set of lists L{i), one for each node i £ V. Each list L{i) consists 
of pairs (d, y) where d is a distance value and y — hi{d) > 0 is an 
estimate on ni{d). The lists are sorted and increasing in both d and 
y. To estimate for ni{x) and n^{x), from the list L{i), we use 

hi (d) = y such that {x, y) = arg max x (4) 

{x ,y)^L{i)\x<d 

fki(d) = t/such that (x,y} = arg max x. (5) 

(x,y)€L(i)\x<d 

That is, we look at the pair (x, y) £ L{i) such that a; < d (or 
a: < d) is maximum and return y. Erom the relations l[^, we can 
obtain estimates nij = hi{dij) and = h^{dij) from L{i) if we 
know dij. 

The lists L{i) are computed from All-Distances Sketches ADS(i), 
which are the subject of the next section. 

3. ALL-DISTANCES SKETCHES 

We preprocess the graph to compute a set of All-Distances Sketches 
(ADS) [TJ for the nodes in the graph. The sketches are defined 
with respect to a parameter k and a random permutation of ran¬ 
kee nodes. We find it convenient to work with r{i) £ [0,1] which 
is the permutation position of i divided by |?7|. Alternatively, it 
is sometimes convenient to work instead with random hash based 
r{i) ~ [/[0,1]. The sketch ADS(i) of a node i £ V consists of a 
set of entries of the form {j, dij), consisting of a node j £ U and 
the distance dij. We assume that r{j) is either included in the entry 
or can be easily retrieved from j. The set of rankee nodes included 
in ADS(i) is a random variable which depends on the assignment 
r: 

j £ ADS(i) r{j) < k*{/t £U\dih< dij} . (6) 

This ADS definition (|6j applies with unique and non-unique dis¬ 
tances. A technical point is that for estimation with non-unique dis¬ 
tances, we also maintain with ADS(i), as auxiliary, entries {j, dij) 
that satisfy for some 2 £ ADS(i) r{j) — k^j/i £ U \ {z} \ 
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dih < rfiz} when these entries are not already included in ADS(i) 
(When distances are unique, all these entries are already in 
ADS(i)). With unique distances, the expected size of the sketches 
is exactly k/i\ < fe In |(7| with good concentration, 

but the sketch can be much smaller when distances are not unique, 
while providing the same statistical guarantees on estimate quality, 
which is why we separately treat non-unique distances rather than 
tie break. 

Our implementation of ADS computation is based on PRUNED 
Dijkstra’s Elm ID- The pseudocode for the basic sequen¬ 
tial version is provided as Algorithm [I] and uses 0{km\nn) edge 
traversals. When applied with non-unique distances, the algorithm 
also includes the auxiliary entries. 


Algorithm 1 ADS set for G via Pruned Dijkstra’s 

for rankee node u aU by increasing r(u) do 
Run Dijkstra’s algorithm from u on 
foreach scanned node v do 

ifdyu > k‘'’{y I {x,y) e ADS(n)} or 
|{x e ADS(n) I dvx < dy„}\ > k then 
prune Dijkstra at v 

€lS6 

L ADS(u) <—ADS(ti) U {(r(n), 


The term scanned node in the pseudocode refers to the event 
where the node v £V is popped from the Dijkstra priority queue. 
Each node can be scanned at most once in each (pruned) Dijkstra 
search. The scanned nodes are always a prefix of the nodes when 
sorted by increasing distance from u in . When a node v is 
scanned, either u is inserted to ADS(n) or the search is pruned at 
V. Therefore, the number of node scans is equal to the ADS size. 

The algorithm builds the ADS of all nodes by considering one 
node M £ (7 at a time and adding it as an entry in ADS(w) for all 
relevant v. To do so efficiently, we maintain the entries in ADS(ii) 
as an array sorted by decreasing distances. The insertion condition 
then amounts to testing if | ADS(u)| < fc or if the entry (a:, y) in 
the I ADS(u)| — k position in the array (the fcth smallest distance) 
has y > dvu, or if it has y = but either | ADS(w)| = k or the 
entry {x, z) in the | ADS(n)| — k — 1 position has 2 > dvu- 

We refer to the kth smallest distance in ADS(n) as the threshold 
distance and denote it by A(w). We also use the notation *{v) for 
the bit indicating if the k -|- 1th smallest is equal to the fcth smallest 
distance. The prune condition can then be written as 

dvu > A(ii) or d VU A(v) and * {v) . (7) 

Observe that insertions can only affect the k last entries in the cur¬ 
rent ADS. Therefore, it suffices to keep only that “tail” part in 
active memory. When k is small we can keep it as an array and 
implement insertions by shifting. When k is larger we can use a 
data structure that supports efficient insertions. 

3.1 Multithreading 

Pruned Dijkstra’s, as stated, sequentially performs possibly 
dependent searches from all rankee nodes. We propose here a de¬ 
sign which allows us to control in a principled way the tradeoff 
between overhead and concurrency. We partition the \U\ pruned 
Dijkstra searches to batches, where each batch is a consecutive set 
of nodes when ordered by increasing r. All the searches in the same 
batch are made independent so that they can be executed concur¬ 
rently. Each search computes a set of proposed entries to sketches 
ADS(ii), as contributions to a set PE{v). A proposed entry is cre¬ 
ated when a node v is visited and the pruning condition 0 is not 
satisfied. The pruning, however, and hence the proposed entries. 


are computed with respect to the set of threshold distances and bits 
(A(u),*(u)) for u € y, as it was at the beginning of the batch. 
Pseudocode for an independent search thread is provided as Algo¬ 
rithm]^ Each such Dijkstra search may generate a proposed ADS 
entry for multiple nodes. 

At the end of a batch, for each node v, the proposed entries 
PE(v) from all the searches in the batch are merged with (the k- 
tail of) ADS(w) (as it was in the beginning of the batch) to compute 
an updated ADS(u) with respect to the end of the batch. The merge 
is performed by scanning the entries (u, d„„) in PE(v) in order of 
increasing r and applying the insertion procedure to ADS(u) as 
used in Algorithm [T] If the pruning condition [7] does not hold, we 
insert u and update ADS(u) (note that this updates A(u) and *(u)). 
Note that not all proposed entries are incorporated, since the inser¬ 
tion rule is not satisfied with respect to the updated (A(u), *(u)) 
after processing previous PE(v) entries. 


Algorithm 2 A Pruned Dijkstra thread (search from u) 

Run Dijkstra’s algorithm from u on G^ 
foreach scanned node v do 

if dvu > A(u) or duu = A{v) and *{v) then 
I prune Dijkstra at v 

6lS6 

L PE{v) y- PE{v) U {(u, duu)} 


3.2 Concurrency/Overhead tradeoff analysis 

The sequential algorithm has the property that all generated en¬ 
tries constitute final ADS entries. The multithreading algorithm 
computes proposed entries that may be eventually discarded. These 
discarded entries are the overhead of the multithreading algorithm. 

More precisely, we define the overhead as the ratio of the ex¬ 
pected number of discarded entries to the expected number of ADS 
entries per node. The overhead depends on how we partition the 
searches to batches. Placing each search in a separate batch would 
result in no overhead, but also no concurrency. Putting all searches 
in the same batch would have a very large overhead, as none of the 
searches would be pruned. 

Note that the overhead of discarded entries corresponds to an 
overhead on edge traversals, which are the main cost of the algo¬ 
rithm. In particular, we can bound the total work performed by 
the multithreaded algorithm by multiplying the sequential bound 
of km In |?7| by (1 -b fi), where ft is a bound on the overhead. 

We next propose batch partitions which allow us to bound the 
overhead. We first observe that the search is never pruned for the 
k nodes with lowest r values. Our first batch would contain these 
nodes, and we can perform those searches independently without 
overhead. At the end of this first batch, all generated proposed 
entries PE{i) would be sorted by distance to form ADS(i) with 
respect to those k nodes. As for subsequent batches, we propose 
exponentially increasing batch sizes and show the following: 

Lemma 3.1. For a parameter jr > 0, consider a partition to 
batches so that the jth batch ends at node in position 
in the sorted order by increasing r(v). Then the expected overhead 
is at most h < y! ln(l -|- ji) — 1. 

Proof. Consider processing a batch that starts at position feo + 

1. The probability of a node in the batch to enter PE (i) is min{ 1, } ■ 

Note that to generate a proposed entry, the node needs to be with 
distance smaller than A(i), that is be among the k smallest dis¬ 
tances among all the nodes processed up to the previous batch and 
itself. This probability is exactly that of being in one of the first 
k positions in a random permutation of 6o + 1 nodes, which is 
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min{l, Now we can consider the probability that a node in 

the batch is a final member of ADS(i). If the node is in position 
bo + j, the probability is min{l, 

We now consider a batch that has nodes in permutation positions 
&o + 1 to 6t, such that bo > k. The ratio of good work to total work 
is 

i/(bo + i) ^ bo +1 1 

(bt - ho)/{bo + 1) bt - bo ^ bo + j 

^ bo + 1 , , bt , 
bt - bo “^^bo + r 

Thus, if we choose bt = (1 + /i)bo for some /r > 0, we obtain 
the ratio ln(l + /r)//r. The overhead, by definition, is the inverse 
of this ratio minus 1. □ 

In particular, we can see that /i = 0.5 results in overhead of 20% 
more edge traversals than the sequential algorithm. Using fj, — 
0.1, has overhead of about 5%. The total number of batches is 
0(log^^^(|?7|/fc)) « which is logarithmic in the 

number of rankee nodes. 

3.3 Cardinality estimation 

We now discuss the computation of a list L{i) from ADS(i). 
Recall that L{i) is a list of pairs the form (d, hi{d)). There is one 
pair for each unique distances d in ADS(i), and we assume L{i) is 
sorted by increasing d. 

We review two estimators ni{d): The bottom-fc estimator and the 
HIP estimator. Both are unbiased, nonnegative, and have a small 
relative error, with good concentration which depend on the ADS 
parameter k. The HIP estimate is tighter; Estimates are at least as 
good as bottom-fc, and with unique distances, has half the variance 
of the bottom-fc estimator. The bottom-A: estimator, however, is 
useful to us because it has some monotonicity property. We can 
compute the lists L{i) using either estimator or both. 

The bottom-fc estimator: This inverse probability estimator |25| 
has coefficient of variation (CV) at most l/\/k — 2 |^|^. To es¬ 
timate ni{x), we take the fcth smallest r value among nodes in 
Ni{x), which we denote by r. If there are fewer than k nodes 
in Ni{x), we return the number of entries as our estimate. Oth¬ 
erwise, we compute the probability p that an r-value is below t. 
When r(v) ~ ?7[0,1], p = t. We then use the estimate hi{x) = 
{k - l)/p. 

The HIP estimator: This estimator has CV at most 1 A/fc — 2 and 
with unique distances is most l/^/2k — 2 (see for exten¬ 
sion to non-unique distances). The estimates are obtained as fol¬ 
lows: For each (non auxiliary) entry j in ADS(i), we compute the 
threshold value 

Tij = k*{b € ADS(i) I dih < dtj} . (8) 

We then compute pij as the probability of r(j) < r^. If there are 
fewer than k entries lower than dtj then ptj = 1. Otherwise, when 
’'(i) ~ ?7[0,1], we have ptj — nj. We then take atj — 1/pij. 
Finally, the HIP estimate (summed over non-auxiliary entries) is 

ni{x) = “V ■ 

36ADS(i)|iiij<x 

Computing the estimates The estimation list L{i) for both the 
bottom-fe and the HIP estimators can be computed by processing 
the entries of ADS(i) in increasing distance order, maintaining the 
fcth smallest values in the prefix processed so far and accordingly 


the feth smallest value r, and computing the estimates ni{d) when 
entries of distance d are processed. 

An easy to verify property that is useful to us is that the neighbor¬ 
hood size estimates for each node are non-decreasing with distance 
from the node and therefore can only increase when the distance 
does: 

Lemma 3.2. When L{i) is computed using either bottom-k and 
HIP estimates, the estimates 0 and 0 satisfy 

di < d2 hi{di) < hi{d2) 

di < d2 Piidi) < fki{d2) . 

4. REVERSE-RANK SINGLE-SOURCE 

As noted in the introduction, if we are interested in computing 
reverse-ranks from a source i to all nodes, we can compute the 
distances djt by applying Dijkstra’s algorithm from i on G^, and 
return estimated reverse-ranks from the distances using 0 and 0. 
The nodes, however, are processed in order of increasing distance, 
which does not necessarily corresponds to the order by increasing 
reverse ranks (recall the example in the introduction). Therefore, if 
we are only interested in correctly identifying nodes with highest 
reverse ranks and we apply this algorithm, we can not prune the 
computation and we will scan a much larger portion of the graph 
than needed. 

In this section we present an approximate reverse-rank single¬ 
source algorithm that provides sorted-access: Computing nodes in 
order of increasing (approximate) reverse-rank. We start by estab¬ 
lishing a basic monotonicity property of reverse-ranks that is essen¬ 
tial for the correctness of our design. 

4.1 SP monotonicity of reverse-ranks 

We noted in the introduction that reverse-rank order does not 
necessarily correspond to distance order. For nodes on a shortest 
path, however, we can show that the reverse-ranks, and the respec¬ 
tive bottom-fe estimates, are monotone: 

Lemma 4.1. Consider a shortest path it,... ,io in G. Then 
TT. . , and the bottom-k estimates tt, , and ffi are all 

J D - Lj fcQ - tj LQ J U 

non-decreasing with j. 

Proof. Consider y < h, then dt^tg < dt^tg. The neighbor¬ 
hoods relations is C Ni^{di^ig). Therefore, Tfijio < 

Tfi,,ig. Similarly, (di^-io) C N_if,{di,,ig) and thus the cardinal¬ 
ities satisfy tt, ■ < tt ■ , . 

In the case of bottom-fe estimates, the claim follows again from 
containment of neighborhoods. Let ri be the fcth smallest r value 
in the contained set and let T 2 be the feth smallest r value in the 
containing set. Then clearly, ri > T 2 . Recall that the bottom-fe 
cardinality estimate is (fe — l)/r. We have (fe — l)/ri < (fe — 
l)/r2. □ 

4.2 Algorithm and analysis 

The pseudocode for our reverse-rank single source algorithms is 
presented as Algorithm[T] The algorithm has the same structure as 
Dijkstra’s algorithm from source s in the transposed graph G^. The 
algorithm maintains each unprocessed node j that is adjacent to an 
already processed node in a min priority queue. The entry contains 
{j, djs), where djs is the upper bound on the distance from j to s. 
This upper bound serves as the priority in Dijkstra’s algorithm and 
is the minimum over processed nodes h of dha -F Wjh- Our reverse- 
rank single source algorithm uses instead a priority as follows. We 
first look at n^. (j), which is an upper bound on the estimated 
reverse-rank, computed according to the best current upper bound 
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3-2 > ndjAj)- There¬ 

fore, when the upper bounds on the distances are tightened, the 
priority can only decrease. Now, two nodes in the priority queue 
can have the same estimate h-^_ (j). In this case, we break ties 

according to the distance upper bounds djs, always preferring the 
node with lower djs. If both rig. (j) and djs are the same, the tie 


can be broken arbitrarily. 

The next node h that is selected from the queue is the one with 
minimum priority according to lexicographic order on (n^ ij)jd.js) 

For this node h we set d^s d^s (a correctness proof that indeed 
dhs is the distance is provided below). We then scan all in-coming 
edges {j, h). If j is not already in the priority queue, we insert it 
with dji = dhs -F w{j, h) and the respective priority. If j is already 
in the queue we compare x dhs + w[j, h) to the current djs. If 
X < djs, we update djs x and update the priority to (j). 

Note that the algorithm applies for both directed and undirected 
graphs. When applied to directed graphs, the algorithm returns re¬ 
verse ranks only for nodes that can reach s. For completeness, we 
explain how to extends this, if needed, also for nodes v that can 
not reach s, that is, dvs = oo. We first need to precisely define 
the rank 7r„s in this case. All rankee nodes that can not be reached 
from s can be viewed as having rank range |?7|], where Rv 

is the set of rankee nodes reachable from v. Now note that we 
can estimate |7?„| by the cardinality estimate associated with the 
maximum-distance entry in L(t;). 


Algorithm 1: Approximate Reverse-rank Single-Source 
Input: Source node S 

Output: A sequence (v, dsv, tTvs) in increasing ttvs 
// Node object V: V.diSt,V.rDr are upper bounds 
on dsv and ttvs ■ Initialization V.Init(d, r): 

v.dist <- d; v.rDr <- r 

Q^0 // Initialize an empty min priority queue 
Q of node objects prioritized by lex order of 

(v.rDr, v.dist) 

S.lnit(l,0) // Initialize source node object 

Q.add(S) // put source node in queue 

while Q is not empty do 

V -4— Q.extract_min () 

Output (V) // output and scan V 

foreach U | (u, v) £ E and U not scanned do 

d 4- V.dist A tCyu 

if U ^ Q then 
I U.Init(nu(d), d) 

I Q.add(u) 
else 

if d < u.dist then 

Q.decrease_key(tt, (riij(d), d)) // Update 

priority of U in Q: U.diSt 4—d; 
u.rDr 4- hu{d) 


For correctness, we need to show that when a node v is popped 
out from the priority queue, we have the correct distance dsv and 
thus can obtain the bottom-fc estimate on ttvs. This holds if all 
nodes that are on the shortest path from n to s were scanned before 

V. 


Theorem 4.1. When Algorithm^is applied with exact cardi¬ 
nalities ni{d) or with bottom-k estimates, it traverses a shortest- 
paths tree from s. 

Prooe. Consider a source s — yo and let yo,yi,... nodes 
sorted by increasing tiy-s with ties broken according to dy^s. We 
show that a node can be scanned only after all the nodes on its 


shortest path to s are also scanned. This means that when scanned, 
its current priority is computed according to its trae distance to s, 
and therefore, uses the bottom-fe reverse-rank estimate. 

We show correctness by induction on t. Assume that the scanned 
nodes are yo,... ,yt and that for these nodes we have exact SP dis¬ 
tances and thus the estimates Tty.s. Consider now yt+i. Consider 
the shortest path P from yt+i to s. It follows from Lemma [4T] 
that the reverse-rank estimates are monotone non-decreasing along 
the path. Also note that distances to s are strictly smaller. There¬ 
fore, all the nodes of path P, except yt+i, are in {yo,... ,yt}, and 
therefore, by induction already scanned. □ 


5. REVERSE-RANK INFLUENCE 

In this section, we consider the computation and maximization of 
reverse-rank influence. Consider a graph with a set of rankee nodes 
U <Z V and ranks iiji defined for rankees i £ U and j £ V. Let 
P{j) b! 0 be the ranker weights of j G V. For a set S C P of seed 
nodes, the reverse-rank influence is Inf(5) = P{j)a{njs), 

where Z G V is th e set of ranker nodes (those with jd^z) > 0). 
From Corollary 6.2 the exact computation of Inf (5) has subcubic 


equivalence to APSP, even when restricted to threshold functions q 
and a single seed. 

We therefore focus on scalably computing the approximate in¬ 
fluence 


Inf(S') = '^P{j)a{jtjs) 


jez 


where itjs = minigs dtji. 

Note that to compute Inf (S') it suffices to compute Ttjs for all 
ranker nodes j G Z. Moreover, when a is a threshold function for 
some T <C n, or more generally, any function with a{x) = 0 
for all a; > T, it suffices to compute Ttjs only for nodes with 
Ttjs < T. A naive way to compute these values is to perform, 
from each seed i G S, a single-source reverse ranks search from 
i, using Algorithm [T] and terminate the search when we scan a 
node with itji > T. We can then combine the results of the dif¬ 
ferent searches, computing the minimum Ttji of each node j that 
is scanned in at least one of the searches, to obtain the values tijs. 
This naive computation requires 1511P | log n operations (assuming 
the lists L{j) are provided) when T is large. But even with smaller 
T, a node j can be scanned multiple times, once for each seed i G S' 
with Ttji < T. We now show how to remove the dependence on the 
number of seeds |S|. 


Theorem 5.1. Forasetof seeds S, we can compute the values 
TVjs for all j £ V using 0(|S| + |P|logn) operations. When 
a{x) — Q for X > T, \E\ is replaced by the number of incoming 
edges to nodes j that satisfy itj s ^ T. These bounds assume that 
the lists L{j) are provided for j £ V. 

Prooe. We slightly modify Algorithm [T] by initializing the pri¬ 
ority queue with entries with priorities (Ldist, LrDr) = (0,1) for 
each node i £ S. The algorithm execution then proceeds as with 
a single source node. For correctness, we can show that nodes 
are scanned (popped from the queue) in increasing lex order of 
(hj{djs), djs), and at the point they are scanned we have j.dist = 
djs and thus j.rDr = hj{djs). 

To see that, first note that djs suffices to obtain Tfjg. This is be¬ 
cause, using Lemma jJ!^ hj {djs) = minigs fij {dji) = hj (minigs dj 

For correctness, we need to show that the monotonicity property 
(Lemma [4.1^ to sets S: Consider a shortest path it,... ,io from it 
to S (that is, io is the closest S node to it). Note that this implies 
that io is the closest S node to all ij. 
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It follows from 


4.1 


that TTi-in, TT - . , and the bottom-fc estimates 

3 — I'j ‘0 

TLi.ig and^i^-ip are non-decreasing with j. We observe that = 


TTi. s and similarly tt ■ ■ = tt, e and this holds for the bottom-fc 

3 •' — t'j iQ —‘‘j ^ 

estimates obtained using di^s- Therefore, monotonicity holds also 
when io is substituted with S. □ 


5.1 Influence Maximization 

Here we consider uniform ranker weights /3(z) = 1 and a that 
is a threshold function for some T. The influence of a set 5 C U 
of rankee nodes is then the number of rankers that have at least one 
node from S among their top T rankees: 

Inf(S') = |{2€^|7r.s<r}|. (9) 

The goal of the IM problem is to find a set S of rankee nodes of a 
certain size which maximizes Inf(S). A common approach to such 
coverage problems is the greedy algorithm. Greedy repeatedly se¬ 
lects a rankee node which has maximum marginal influence. For 
each s > 1, the set S of the first s selected seeds is guaranteed to 
have influence that is at least 1 — (1 — 1/s) “of the maximum pos¬ 
sible by s seeds. Algorithm|^is an exact greedy algorithm for our 
reverse-rank IM problem with influence function 0. The compu¬ 
tation of the algorithm is dominated by Dijkstra computations from 
each ranker node that are stopped once T rankees are popped from 
the priority queue. Recall that when distance are not unique, we 
can work with multiple definition of the rank (see Section]^, but 
with all of them, we can determine the ranks once at most T + 1 
rankees are popped, example, when we use W, then if the T + 1 
rankee has the same distance d as the T rankee, then all rankees of 
distance d are excluded (they have rank larger than T). Even when 
all nodes are rankees, and thus at most T nodes are popped in total 
in each Dijkstra run, the required computation is Q{T\E\ log |H|), 
which does not scale well for large values of the threshold T. 


Algorithm 2: Exact greedy reverse-rank IM 

Input: Directed graph G = {V, E), ranker nodes Z (Z V, rankee 
nodes U G Z, threshold T 
Output: Exact greedy sequence seedlist 
seedlist ■<—±// output list of (seed,marginal 
influence) 

forall the rankee nodes u ^ U do COverage[w] ^ 0 
forall the ranker nodes 2: E Z do 
coverers[z] ■<- 0 

Run Dijkstra from 2 in G, until (we can determine that) ttzu > T 
foreach rankee u £ U with ttzu < T do 
I coverers[z] coverers[^] u {w} 

|_ coverage^ ^ coverage^ u {z} 

while There are rankees u with |coverage[w]| > 0 do 
V t- argmax„g[7\s |coverage[w]| 

Append (u, |coverage[H]|) to seedlist 
foreach 2; e coverage[M] do 
foreach v s coverers[z] do 
|_ Remove 2: from coverage[v] 

Delete coverage)!/] 
return seedlist 


5.2 Approximate Greedy IM 

We next obtain a near-linear algorithm using two relaxations. 
First, the greedy selection, and thus the statistical guarantees we 
obtain, are with respect to the relaxed influence function where 
TT replaces tt: 

f^(S) = l{z€Z\x,s<T}l. ( 10 ) 


Second, we do not compute an exact greedy sequence for Inf{S) 
but instead use an approximate greedy algorithm: At each step, 
selects a node with marginal influence that is approximately (within 
a small relative error) the maximum. 

Our design adapts the influence maximization algorithms SKIM 
and T-SKIM|I0| |II| which are designed for reachability-based 
|26| and distance-based (22l[T5l[TT) influence with threshold func¬ 
tions. We quickly review SKIM, which remarkably, when all 
nodes are both rankers and rankees, computes a full approximate 
greedy permutation in near linear time. To do so efficiently, SKIM 
samples nodes not covered by previously selected seeds, and main¬ 
tains for each candidate seed node the number of sampled nodes 
it covers. Reachability-based SKIM performs a pruned reverse 
graph searches from the node to determine the nodes that cover 
it. The distance-based SKIM performs backward pruned Dijkstra 
searches. The node that first reaches some number K of samples 
has approximately maximum marginal influence and is selected as 
a seed. The sample-size parameter K determines a tradeoff be¬ 
tween computation and accuracy. SKIM then updates the samples 
so that they are with respect to the updated marginal influences with 
the coverage of the new seed node removed. SKIM also updates 
the representation of the residual problem. The updates are per¬ 
formed using a respective forward (Dijkstra) search from the new 
seed to reveal all nodes that it covers. When a previously sampled 
node becomes covered, the samples of the nodes covering it are ad¬ 
justed to reflect their reduced marginal coverage. Sampling is then 
resumed until another node reaches a sample size of K. We repeat 
the process of sampling, selecting a seed, and updating the resid¬ 
ual problem until a desired number of seeds is selected, a desired 
coverage is achieved, or all nodes are covered. 

Our Algorithm]^, reverse-rank SKIM (rR-SKIM), follows the 
SKIM design, of iterating the selection of a new seed node (ran¬ 
kee) via sample building and updates. The reverse-rank problem, 
however, requires some critical adaptations. 

When sample building, we repeatedly select random uncovered 
ranker nodes 2. We then run Dijkstra’s algorithm from 2 but stop 
the search when the approximate rank ttzu exceeds T. For each vis¬ 
ited rankee node u, we increment the sample size sample_size[M] 
and also add w to a list inverted_sample[z] (the list of nodes where 
2 is included in the sample). This process stops when the first 
rankee u reaches sample_size[n] = K. The node u then be¬ 
comes the next seed node. We then apply our sorted-access reverse- 
rank single source computation from u, up to rank T, to deter¬ 
mine the coverage of the new seed u. We mark all uncovered 
visited nodes as covered. For each newly covered ranker 2, we 
scan inverted_sample[z] and decrement sample_size[v] for each 
V e inverted_sample[z]. We then delete inverted_sample[z]. 
For each covered ranker 2, we maintain the (approximate) rank 
of the best seed best_seed[2]. ronfc = min„gs and the cor¬ 
responding minimum distance best_seed[2]. disf = min„gs dvz 
(note that the node with minimum distance must have minimum es¬ 
timated rank). The purpose of maintaining best_seed is to enable 
pruning of reverse searches. Pruning is critical for the near-linear 
computation bound of the algorithm (without it, we can construct 
examples were the bulk of covered nodes is revisited with each new 
seed, resulting in fl(|seedlist|m) computation). 

A search from the new seed u is always pruned at 2 when ttuz > 
T, but is also pruned when 

-Kuz > best_seed[2].ronA: or (11) 

TTuz = best_seed [2] .ranfe and d^z > best_seed[2].disf. 

We now need to show that also with this pruning, the algorithm 
maintains the following invariant 
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Lemma 5.1. After the processing of a new seed node, all nodes 
z with min„gs it^uz < T have best_seed[2].ranfc = min^gg itzv 
and best_seed[2].disl: = min^gg and all other nodes have 
best_seed [2].ranfe = +00. 

Proof. This property clearly holds when pruning only when 
TTiiz > T, since after inserting a new seed node u our reverse-rank 
search from u visits all nodes with 7r„2 < T. 

We establish the claim using induction on added seeds. The base 
of the induction is when S is empty and best_seed [2] .ranfc = 
+00. Assume now that our invariant holds and let S2 be a newly 
selected seed node. Let ui be a node on which we prune the search 
from S 2 - From the condition 0 there exist a seed node Si such 
that tTuijSi ^ tVui,s 2 ttr — 'ttu-i,s 2 and dui,si ^ dui,s 2 ' 

From the definition of our estimators, 7r„^,si < 'ttui,s 2 implies 
dtii.si < dtii,s 2 - Combining, we obtain that dui,si < d^i.sa- 

Now assume to the contrary there is a node U 2 such that ui is on 
the shortest path from U 2 to S 2 and '7r„2,s2 < ^U 2 ,si or tTu 2 ,b 2 = 
■if 112 , 31 and du 2 ,s 2 < du 2 ,si ■ We show that this is not possible. 

Using the above and triangle inequality we obtain du 2 ,si < 
du 2 ,ui + dui,si < du 2 ,ui + dui,s 2 = <1^2,32- A property of 
our estimates is that for any three nodes du 2 ,si — <^ 112,32 implies 
'itu2,si ^ tru2,S2- C 

The analysis of computation and approximation quality uses com¬ 
ponents from the analysis of r-SKlM|ll|. An important critical 
component in the analysis is that we can “charge” edge traversals 
used for sample building to increases in sample sizes. When there 
are many non-rankee nodes, we can construct worst-case graph 
where non-rankees are repeatedly traversed without incrementing 
sample counts. In realistic models, however, and when all nodes are 
rankers or rankees, we would expect such popular ranker hub nodes 
to be covered quickly by the first few selected seeds. Another com¬ 
ponent of the analysis that carries over from T-SKIM is bound¬ 
ing the number of updates to best_seed[2]. The argument there 
critically relies on the sample based approximate greedy selection. 
The approximation quality of the algorithm can only be guaranteed 
probabilistically and with respect to approximate ranks 7f„z. To 
summarize, when we run the algorithm with K = 0(t~^ logn), 
and prune sampling searches using the approximate ranks if, we 
obtain the following. 

Theorem 5.2. With very high probability, for all s > 1, the 
influence Inf of the first s seed nodes is at least 1 — (1 — l/s)'* — e 
times the maximum possible Inf with s seeds. When all nodes are 
both rankers and rankees, the algorithm uses 0{\E\e~^ log^ n -1- 
\E\e~‘^ logn) operations. 

5.3 Approximability of the exact problem 

Distance-based influence maximization is known to be at least as 
hard as max cover also in terms of inapproximability, by a seminal 
result of Feige 0. Thus, we know that in a sense Greedy is the 
best scalable algorithm. What we can say about reverse-rank in¬ 
fluence maximization with a threshold kernel T is that it is at least 
as hard as max cover, when each element can be a member of at 
most T sets. The problem is NP-hard for T > 2 (by reduction 
to max vertex cover), but Feige’s inapproximability result does not 
apply. This leaves open the possibility that some polynomial-time 
algorithms have better approximation ratio than Greedy. 

When T = 1, the influence function is simply the number of re¬ 
verse nearest neighbors. In this case, the coverage sets of different 
nodes are disjoint and influence maximization is trivial: The greedy 
permutation which selects nodes in decreasing order according to 
number of reverse nearest neighbors is optimal. 


When T = 2, each node can be covered by at most two other 
nodes, which is similar to max vertex cover, which is also NP 
hard, but has a polynomial approximation algorithms that achieves 
a slightly better approximation ratio than the greedy guarantee of 
1 — (1 — l/s)'* (T8| . The Linear Programming based algorithm, 
however, does not scale for large inputs and also does not seem to 
apply for our general case of T > 2. 


Algorithm 3: reverse-rank SKIM 

Input: Directed graph G = {V, E), ranker nodes Z G V, rankee 
nodes U G V, threshold T, parameter K 
Output: Approximate greedy sequence seedlist 

// Initialization 

forall the nodes u G V do best_seed[H].rank <— 00 
forall the rankee nodes v G U do sample_size[v] <— 0 
inverted_sample ■<—-L // Hash map of ranker nodes to 
sets of rankee nodes 

coverage <—0 // Coverage of current seedlist 

seedlist -t—T // output list 

F <— random shuffle of the ranker nodes Z 

while I coverage I < \Z\ and (3 unscanned u G F or 
max„g (7 sample_size[«] > 0) do // select seed 

X <—3- // next seed node 

while 3 unscanned u G F do // Build samples 

u <— next node in shuffled sequence F 

if best_seed[u]. rank < 00 then // Node u is 

covered, skip it 

L Continue 

// Find all rankees v with kuv f: T 
Run a Dijkstra search from u in G, during which 
foreach scanned rankee node v G U do 
if TTuv > T then terminate the search 
sample_size[v] ^ sample_size[v] - 1 -1 
inverted_sample[M] <- inverted_sample[M] u {u} 

if sample_size[v] = K then 

X V // Next seed node 

abort sample building loop 

if X =± then 

X <— argmaxugj/ sample_size[M] 
if sample_size[x] = 0 then abort main loop 

Ix 0 // Estimated coverage of x 

// Compute Ix and update residual 
run pruned reverse-rank single-source search from x in transposed 
graph G^, during which 

foreach scanned node v G V with (ttvu, d^u) do 
if kxu > T or best_seed[v]. rank < ttxu or 
best_seed[v].rank = TTxu and best_seed[v].cfisf < dvu 

then 

|_ prune at v 

if best_seed[v] = 00 and v G Z then // v is a 

newly covered ranker 

Ix Ix + I coverage <— coverage + 1 

forall the nodes w in inverted_sample[v] do 
1^ sample_sizeM -f- sample_sizeM - 1 
inverted_sample[v] ^ ± // Delete 

best_seed[v].rank ■(— TTxu 
best_seed[v].drst <— 

seedlist.append (x, Ix) 

return seedlist 


6. HARDNESS OF EXACT REVERSE-RANK 
SINGLE SOURCE COMPUTATION 
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Exact single source reverse-rank computation from a node u will 
return 7ri„ for all nodes i. Clearly, it can be solved using an APSP 
computation. We show the following 

Theorem 6.1. The reverse-mnk single source problem has sub- 
cubic equivalence to APSP. 

We give a reduction from the Graph Radius Problem: The ra¬ 
dius of a graph G, is defined as the minimum over nodes u of the 
maximum distance from v to another node 

i? = min max . (12) 

uev vev 

The graph radius problem on undirected graphs is known to have 
suhcuhic equivalence to APSP jlj. 

Given a graph G = {V, E), and a length parameter x, we con- 
stmct a new graph Gx = {V',E') hy adding a new node V' — 

V U { 16 } and adding edges from u to all n G V with length x. 

Lemma 6.1. Let G be a graph with radius Rand consider x > 

0. If R > X then in Gx,for all nodes v £V, 7r„u < \ V\. If R < x 
then there must exist a node z such that tvzu = \V\ in Gx- 

Prooe. Suppose that R> x then hy definition of radius, for all 
nodes z, max^gy > R > x. Therefore node u will not be the 
farthest from z and we have 7rz„ < |1^|. 

Suppose now that R < x. let z G Vhe such that R = max„gv dzv 
Then for all n G T^, dzu — x R dzx and thus rtzu — |1/|. □ 

From the lemma, we can compute the graph radius R hy per¬ 
forming a logarithmic number (in the representation of G) of exact 
reverse-rank single source computations on graphs the size of G. 
This concludes the proof of Theorem |6.1| 

Corollary 6.2. Exact computation of reverse-rank influence, 
even with a single seed node IS] = 1, uniform /3, and a threshold 
function a, is sub-cubic equivalent to APSP. 

Prooe. We use the same construction and compute influence 
(centrality) of node u with a being a threshold function with T = 
|F|-1. □ 

7. EXPERIMENTAL EVALUATION 

We implemented and evaluated our algorithms for computing 
ADS, approximate reverse-rank single-source, and influence max¬ 
imization. Our implementations are in C-l-l- and were compiled 
using gcc (g-l-l-) with full optimization. Our testing machine runs 
Centos 6.5 and uses Dell PowerEdge R720 server with two Intel 
Xeon E5-2640 CPUs. Each with 12 cores (2.50GHz, 12x32kiB 
LI, 6x256kib L2, and 15MiB L3) and 264GiB of RAM. The disk 
capacity is IT. 

Table[2shows the social graphs used for our evaluation, all taken 
from the SNAP project p5| . For each graph we list the number of 
nodes and edges and whether edges are directed. These data sets 
did not distinguish between edges, so we used uniform lengths of 1. 
Our implementations, however, are designed to work with general 
positive edge lengths. 

7.1 Sequential ADS computation 

Table [T] also lists, for each instance, performance figures (time 
and memory usage) of our optimized sequential implementation of 
Pruned Dijkstras (Algorithm [TJ. We list performance for ADS pa¬ 
rameter values k — 16, 64,128 (higher k implies larger sketch size 
and processing and higher estimation quality). The listed times are 
broken into load time - loading the graph into memory data struc¬ 
tures, ADS time - computing the sketches, and ests time - process 


the ADS sketches to compute the distance to cardinality estimation 
lists L{i). We can see that ADS computation was the dominant 
component. Overall, the preprocessing time is of the order of few 
hours, even on our largest data set. The table also lists the virtual 
memory usage of the different runs. For reference, we provide in 
Table|^the running time of computing the T nearest neighbors for 
all nodes in the graph for T = 16,64,128. We can see that our 
ADS computation times are comparable with this simpler opera¬ 
tion. 

7.2 Multithreaded ADS algorithm 

We next evaluate our implementation of the multithreaded ADS 
algorithm (Section [3.1[ >. The evaluation was done by generating 1 
to 14 concurrent threads. We used batch size parameter p, — 0.1. 
The parameter p — 0.1 was selected since it had the best perfor¬ 
mance on a test of sweeping p between 0.05 and 1 and consider¬ 
ing 1-14 threads on the slashdot graph. We note that the amount 
of concurrency provided in the algorithmic design is much larger, 
but the architecture of our machine, mainly number of cores and 
shared caches, limited the benefit of using more threads. We show 
the results for executions with ADS parameter k — 16. The time 
to load the graph into memory and the total virtual memory used 
did not vary much for the same instance and different numbers of 
threads. Table lists the load-time and virtual memory numbers 
for 7 threads. The table also shows the run time on a single thread. 
Note that it can be slightly larger than our optimized sequential im¬ 
plementation. Figure shows the running times, as a fraction of 
the running time on a single-thread, as a function of the number 
of concurrent threads. We note that the number of threads listed 
is the concurrency generated by our program scheduler - the ac¬ 
tual number of cores allocated by the OS was sometimes smaller 
and we had no control over it. We observe significant benefit of 
the multithreaded design, in particular for the larger graphs where 
we obtain up to a factor of 3 speedup, also with respect to the opti¬ 
mized sequential implementation. We note that most of the speedup 
is obtained using only 2 — 6 threads. 


Table 2: Multithreaded ADS computation 


instance 

load 

[Sec] 

Memory 

[GiB] 

1-thread 
[Sec] 

Facebook 

0.13 

0.56 

0.69 

Slashdot 

0.57 

1.1 

10.0 

Twitter 

23 

2.2 

120 

YouTube 

3.9 

3.3 

157 

LiveJournal 

36 

11 

1541 


7.3 Reverse-rank single-source computation 

Table shows running times of our approximate reverse-rank 
single source computations, averaged over 1000 source nodes se¬ 
lected uniformly at random. The times listed are net per computa¬ 
tion after loading the graph and pre-computed sketches L(i) into 
memory. We show running times for the different ADS parameter 
values k = 16, 64,128. For reference, we also show running time 
for Dijkstra’s algorithm (single-source distances) averaged over the 
same 1000 source nodes. We can observe that the running times of 
our reverse-rank single-source computation do not depend on the 
sketch parameter k and are similar to Dijkstra computations. The 
table also shows extrapolated running time for APSP computation. 
The extrapolation was obtained by multiplying the time for a single 
Dijkstra run by the number of nodes. This is listed for reference, 
since exact reverse-rank single-source computation is equivalent to 
an all-pairs computation. 
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Table 1: Test Instances and preprocessing time (single thread) 


instance 

(un/directed) 

#nodes 

#edges 

load 

[Sec] 

Preprocess {k 

= 16) 

Preprocess {k 

= 64) 

Preprocess {k 

= 128) 

ADS 

[Sec] 

mem 

[GB] 

ests 

[Sec] 

ADS 

[Sec] 

mem 

[GB] 

ests 

[Sec] 

ADS 

[Sec] 

mem 

[GB] 

ests 

[Sec] 

Facebook (u) 

3,959 

84,243 

0.3 

0.5 

0.03 

0.043 

0.7 

0.03 

0.140 

1.4 

0.04 

0.341 

Slashdot (d) 

77,360 

905,468 

0.5 

5.7 

0.1 

0.276 

23 

0.22 

1.91 

42 

0.4 

5.35 

Twitter (d) 

456,626 

14,855,842 

22 

88 

0.9 

2.00 

338 

1.4 

13.5 

523 

2 

39 

YouTube (u) 

1,134,890 

2,987,624 

3.3 

116 

1.4 

6.69 

404 

3.3 

41.6 

770 

5.3 

118 

LiveJournal (u) 

3,997,962 

34,681,189 

32 

1481 

6.4 

29.7 

4,901 

18 

209 

9,555 

31.5 

642 



Fignre 2: Multithreaded computation. Speednp (ratio of time 
to single-thread time) as a fnnction of the number of threads. 

The table also displays the average relative errors (ARE) for each 
sketch parameter. Since it was not possible to scalably compute the 
exact reverse-rank values even for a single source, we computed 
instead the estimation errors on the ranks using the Dijkstra runs: 
The errors were therefore averaged over all the ranks provided by 
1000 different rankers instead of all the ranks “received” by 1000 
different rankees. We can see that the ARE, as expected, decreases 
with the ADS parameter k and are within the theoretical bounds. 
Note that a fixed set of sketches was computed during preprocess¬ 
ing and used in all subsequent computations. Therefore the esti¬ 
mates on reverse-ranks of different source-destination pairs can be 
highly dependent. 

7.4 Reverse-rank distributions 

Our implementation allowed us, for the first time, to view the 
reverse rank distributions of nodes in a large network. Figure]^ 
(left) shows the cummulative reverse rank distributions yfjs for 4 
selected source nodes in the YouTube network. For each node s, 
we sort the (estimated) values Ifjs for all nodes j in increasing or¬ 
der. The cummulative distribution plot then shows the value y at 
each position x. The figure also includes a reference line where for 
any i there is a node with rank i. The reference line is in a sense 
corresponds to an “average” source node, which gives and receives 
the same influence. 

We can get information on the relative importance of a node in its 
“locality,” for varying locality ranges, from its reverse-rank distri¬ 
bution. Nodes that are important in their locality would have distri¬ 
butions that at least initially lie well below the reference line. This 
means that for some i, there are many more than i nodes that rank 
them below i. Node #2711 and #480 are example influential nodes 
that remain important across neighborhood scales. Node #368749 
has low influence with distribution above the reference line across 
ranges. Node #3394 has low influence on most ranges except for 
its immediate neighborhood, where it has average influence, and 
on the longest scale, when looking at its 7 x 10® and above highest 



Figure 3: Cummulative distributions on YouTube graph. Left: 
Reverse-rank Right: Distance. 


rankers (which is 35% of total nodes), which indicates that it lies 
closer to the “core” of the network. Note that we plot vf, meaning 
ties are broken “upwards,” which biases towards being above the 
reference line. 

Figure]^ (right) provides, for comparison, the cummulative dis¬ 
tance distributions for the same nodes: For each number of hops y, 
we see the number x of nodes within y hops. The distance distribu¬ 
tion captures the expansion rate, but does not quantify well the rela¬ 
tive status of a node within its locality: A less influential member of 
a dense community would have higher expansion than an influential 
member of a sparser community. As a simplified example think of 
two nodes A and B with the same degree A such that all neighbors 
of A have degree <C A and all neighbors of B have degree ^ A. 
In this case we may view A as being influential in its neighborhood 
whereas B will not be. The reverse-rank distribution will correctly 
make this distinction whereas the distance-distribution will not. 

7.5 Influence maximization 

We next evaluate the performance of reverse-rank SKIM, in terms 
of both the quality of the coverage and the running time. The eval¬ 
uation used the social graphs listed in Table with uniform edge 
lengths and all nodes being both rankers and rankees. We used the 
TT = 7f 0 interpretation of rank. We study dependence on three 
parameters, T, k, and K: The threshold value T specifies the cov¬ 
erage rate. The ADS sketch parameter k determines the quality of tt 
as estimates of the true ranks tt, and thus, the relation between Inf 
(H, which we optimize for, and the true Inf. Finally, the sample 
size parameter K determines the quality of the coverage in terms 
of the approximate influence Inf Larger K mean that we are 
more likely to select seeds with marginal Inf influence that is closer 
to the maximum. 

Recall that the computation of an exact greedy sequence, with 
respect to either Inf or Inf (Algorithm]^, is 0{T\E\), where the 
O notation suppresses logarithmic factors. The computation of 
reverse-rank SKIM uses ADS computation of 0{k\E\) and addi¬ 
tional computation with worst-case bound of 0(K\E\). Moreover, 
note that in actuality, the time is 0(K\E\p), where p is the ratio 
between the average and maximum influence of a node, and for 
typical skewed influence distribution, we have p <C 1. Therefore, 
we expect the scalability advantage of SKIM to become more sig- 
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Table 3: Reverse-Rank Single Sonrce computation, averaged over 1000 source nodes selected uniformly at random. 


reverse-rank single source 


Dijkstra 


fc = 16 


fc = 64 


k = 128 


instance 


[Sec] ARE [Sec] ARE [Sec] ARE [Sec] 


APSP 


[hours] 


Facebook 

Slashdot 

Twitter 


0.006 

0.071 

0.96 


0.11 

0.34 

0.29 


0.006 

0.077 

1.00 


0.072 

0.12 

0.050 


0.006 

0.074 

0.88 


0.067 

0.076 

0.076 


0.002 

0.06 

0.54 


0.002 

1.3 

68 





Youtube Twitter 



LiveJournal 


Figure 4: Fractional coverage Inf(S')/|l/| as a function of seed 
set size |S'| for r7?-SKIM, for varying T. 


nificant for larger T. 

When evaluating the effect of the sample size K, we fixed the 
ADS parameter to be fc = 128 for the four smaller graphs and 
fc = 64 for the larger one and used several values of T < 10"*. We 
computed the exact greedy selection with respect to Inf, which is 
obtained by selecting a node with maximum marginal Inf in each 
step. This was done on the four smaller graphs. On these graphs, 
sample size K = 100 was almost always within a fraction of a 
percent of the exact greedy coverage. The one exception was on 
Siashdot and T — 10"^ where the first seed had coverage that is 
6% lower than the optimal one and the gap closed up for the first 
three seeds. With K = 500, the approximate greedy coverage al¬ 
most exactly matched the exact greedy coverage. For LiveJournal, 
we only evaluated the coverage for sample size up to if = 500, 
but the performance with K = 100 already matched that. We 
note that these observed errors are much lower than the worst-case 
guarantees provided in our analysis. The explanation is the skew of 
the influence distribution is skewed, where the node of maximum 
marginal influence is well separated from the second maximum, 
and with very few nodes having influence that is more than a frac¬ 
tion of the maximum. 

Our implementation allows us to examine the coverage to seed 
set size tradeoffs as a function of the threshold T. These tradeoffs 


provide structural insights on the networks and results are shown 
in Figure]^ Higher values of T as expected have higher coverage 
with fewer seeds. We can also see a highly skewed and asymmetric 
distribution of importance. For example, the LiveJournal graph 
with nearly four million nodes, there is a single node that 4x 10^ 
other nodes rank within their top T = 100. The first 11 nodes have 
1.6 X 10® nodes ranking at least one of them in their top 100. For 
T = 1000, the top seed covers 3 x 10® nodes and the top 12 cover 
7.5 X 10® (a quarter of all nodes). 

Table|^lists selected single thread running times for reverse-rank 
SKIM. Listed times do not include ADS computation (see Table 
[TJ, but this preprocessing time was only a fraction of the total. We 
note that the running time did not significantly depend on ADS size 
(the parameter k). The parameter k can impact running time only 
because it can generate longer neighborhood estimate lists L{i). 
The size of these lists, even with very large k, is below the effec¬ 
tive diameter of the graph, which was small in our data sets. The 
listed times in the table use k = 128 for the four smaller graphs and 
fc = 64 for the largest one. They correspond to computing the full 
sequence (until all nodes are covered). Note that the running time 
can be significantly reduced if we stop when a desired coverage or 
number of seeds are reached. We can also observe that the running 
time grows linearly with the sample size K. An interesting obser¬ 
vation is that for the largest graphs, the computation is faster for 
larger values of T - This is because SKIM works with the residual 
problem, and its size decreases more rapidly for higher values of 
T. This is in contrast to an exact greedy computation (Algorithm 
1^, where the running time increases rapidly with T. Selected run¬ 
ning times, and the slowdown factor with respect to reverse-rank 
SKIM (including ADS computation), are provided in Table|^ We 
can see that for very small values of T the exact computation is fea¬ 
sible but for larger values of T, the running time and the slowdown 
factor increase rapidly. 

Finally, we evaluate the quality of our approximate greedy se¬ 
quences which were optimized for Inf, in terms of the exact influ¬ 
ence objective Inf. To do so, we used a variation of Algorithmic 
to compute the exact influence of the sequence of seeds returned 
by reverse-rank SKIM. We observed that even for ADS parameter 
fc = 64 and k = 128, the Inf coverage of the approximate greedy 
sequence for Inf was consistently within 5% of the exact greedy 
sequence for Inf, and typically much closer. 
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Table 5: Exact greedy running times and speedup factor of 
reverse-rank SKIM (inclnding sketch computation) 


Instance 

T 

[hours] 

factor 

YouTube 

10 

1.50 

x5 

YouTube 

10^ 

7.26 

x23 

YouTube 

10^ 

20.74 

x68 

LiveJournal 

10 

3.01 

xl 

LiveJournal 

10^ 

11.59 

x4 


8. CONCLUSION 

Rank-based measures were used for decades as an alternative to 
distance-based measures. Here, we defined and motivated rank- 
based measures of centrality and influence, which we believe will 
become important tools in network mining and analysis. We then 
presented novel highly scalable algorithms for fundamental rank- 
based computations on graphs, including a Dijkstra-like approxi¬ 
mate reverse rank single-source algorithm which faciliates reverse- 
rank influence computation and reverse-rank greedy influence max¬ 
imization. We complement our work with hardness results that in¬ 
dicate that exact computation inherently scales poorly. 

A contribution we make that is of even broader interest is a novel 
multithreaded design for computing all-distance sketches (ADS) 
which provided the fastest implementation for computing these sketches 
on multi-core architectures. This design is relevant to many other 
applications of distance sketches. 

Going forward, we plan to extend our reverse-rank IM computa¬ 
tion to general decay functions, design a multithreaded implemen¬ 
tation, and open source our implementations. We also hope to use 
our newly available tools to explore and understand the relation be¬ 
tween distance-based and rank-based influence. 
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APPENDIX 


Table 6: Computing T nearest neighbors for all nodes 


instance 

T = 16 
[Sec] 

T = 64 
[Sec] 

T = 128 
[Sec] 

Facebook 

0.29 

0.65 

0.68 

Slashdot 

9.08 

24.45 

45.32 

Twitter 

96.78 

222.48 

412.54 

YouTube 

1,034.93 

1,570.93 

2,836.12 

LiveJournal 

10,589.59 

12,885.12 

19,183.58 
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